12. PPO summary
PPO Summary
TLPPO Summary V1
So that’s it! We can finally summarize the PPO algorithm
- First, collect some trajectories based on some policy \pi_\theta, and initialize theta prime \theta'=\theta
- Next, compute the gradient of the clipped surrogate function using the trajectories
- Update \theta' using gradient ascent \theta'\leftarrow\theta' +\alpha \nabla_{\theta'}L_{\rm sur}^{\rm clip}(\theta', \theta)
- Then we repeat step 2-3 without generating new trajectories. Typically, step 2-3 are only repeated a few times
- Set \theta=\theta', go back to step 1, repeat.
The details of PPO was originally published by the team at OpenAI, and you can read their paper through this link